-
Notifications
You must be signed in to change notification settings - Fork 118
[WIP] Secure HDFS Support #373
[WIP] Secure HDFS Support #373
Conversation
…log` (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572) ## What changes were proposed in this pull request? Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table. To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows: Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds: 7.901 3.983 4.018 4.331 4.261 Spark at bdc8153, `SHOW PARTITIONS table2` (Timed out after 10 minutes with a `SocketTimeoutException`.) Spark at this PR, `SHOW PARTITIONS table1`, times in seconds: 3.801 0.449 0.395 0.348 0.336 Spark at this PR, `SHOW PARTITIONS table2`, times in seconds: 5.184 1.63 1.474 1.519 1.41 Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master. This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x. ## How was this patch tested? I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions. Author: Michael Allman <[email protected]> Closes apache#15998 from mallman/spark-18572-list_partition_names. (cherry picked from commit 772ddbe) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark. This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan. ## How was this patch tested? `test("foreach with watermark: append")`. Author: Shixiong Zhu <[email protected]> Closes apache#16160 from zsxwing/SPARK-18721. (cherry picked from commit 7863c62) Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request? I jumped the gun on merging apache#16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#16170 from hvanhovell/SPARK-18634. (cherry picked from commit 381ef4e) Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request? Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession: ``` scala> spark.time { spark.range(1000).count() } Time taken: 77 ms res1: Long = 1000 ``` ## How was this patch tested? I tested this interactively in spark-shell. Author: Reynold Xin <[email protected]> Closes apache#16140 from rxin/SPARK-18714. (cherry picked from commit cb1f10b) Signed-off-by: Herman van Hovell <[email protected]>
…tructured Streaming log formats ## What changes were proposed in this pull request? To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog. ## How was this patch tested? new unit tests Author: Tathagata Das <[email protected]> Closes apache#16128 from tdas/SPARK-18671. (cherry picked from commit 1ef6b29) Signed-off-by: Shixiong Zhu <[email protected]>
…es in pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <[email protected]> Closes apache#16082 from lins05/include-data-in-pyspark-dist. (cherry picked from commit bd9a4a5) Signed-off-by: Sean Owen <[email protected]>
…rmatted string instead of millis ## What changes were proposed in this pull request? Easier to read while debugging as a formatted string (in ISO8601 format) than in millis ## How was this patch tested? Updated unit tests Author: Tathagata Das <[email protected]> Closes apache#16166 from tdas/SPARK-18734. (cherry picked from commit 539bb3c) Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources. ## How was this patch tested? Manually ran maven test Author: Tathagata Das <[email protected]> Closes apache#16183 from tdas/SPARK-18671-1. (cherry picked from commit 5c6bcdb) Signed-off-by: Tathagata Das <[email protected]>
…logit. ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <[email protected]> Closes apache#16117 from yanboliang/spark-18686. (cherry picked from commit 90b59d1) Signed-off-by: Yanbo Liang <[email protected]>
Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <[email protected]> Closes apache#16131 from actuaryzhang/master. (cherry picked from commit b828027) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <[email protected]> Closes apache#16129 from srowen/SPARK-18678. (cherry picked from commit 79f5f28) Signed-off-by: Sean Owen <[email protected]>
…esToBytesMap ## What changes were proposed in this pull request? BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate. However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM. This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills. ## How was this patch tested? Existing tests and tested on realworld workloads. Author: Jie Xiong <[email protected]> Author: jiexiong <[email protected]> Closes apache#15722 from jiexiong/jie_oom_fix. (cherry picked from commit c496d03) Signed-off-by: Herman van Hovell <[email protected]>
…y column is not attribute ## What changes were proposed in this pull request? Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection. ## How was this patch tested? existing and additional unit tests Author: Andrew Ray <[email protected]> Closes apache#16177 from aray/SPARK-17760. (cherry picked from commit f1fca81) Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request? It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#16192 from zsxwing/SPARK-18764. (cherry picked from commit dbf3e29) Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? When SSL is enabled, the Spark shell shows: ``` Spark context Web UI available at https://192.168.99.1:4040 ``` This is wrong because 4040 is http, not https. It redirects to the https port. More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481. CC: mengxr liancheng I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled. Author: sarutak <[email protected]> Closes apache#16190 from sarutak/SPARK-18761. (cherry picked from commit bb94f61) Signed-off-by: Marcelo Vanzin <[email protected]>
…taLossSuite ## What changes were proposed in this pull request? Fixed the following failures: ``` org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout. ``` ``` sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146) Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null at java.util.ArrayList.addAll(ArrayList.java:577) at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257) at org.apache.kafka.clients.Metadata.update(Metadata.java:177) at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605) at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) at ... ``` ## How was this patch tested? Tested in apache#16048 by running many times. Author: Shixiong Zhu <[email protected]> Closes apache#16109 from zsxwing/fix-kafka-flaky-test. (cherry picked from commit edc87e1) Signed-off-by: Tathagata Das <[email protected]>
Based on an informal survey, users find this option easier to understand / remember. Author: Michael Armbrust <[email protected]> Closes apache#16182 from marmbrus/renameRecentProgress. (cherry picked from commit 70b2bf7) Signed-off-by: Tathagata Das <[email protected]>
… python example and document ## What changes were proposed in this pull request? Logistic Regression summary is added in Python API. We need to add example and document for summary. The newly added example is consistent with Scala and Java examples. ## How was this patch tested? Manually tests: Run the example with spark-submit; copy & paste code into pyspark; build document and check the document. Author: [email protected] <[email protected]> Closes apache#16064 from wangmiao1981/py. (cherry picked from commit aad1120) Signed-off-by: Joseph K. Bradley <[email protected]>
… should be sent only to the listeners in the same session as the query ## What changes were proposed in this pull request? Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is, - StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it. - StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent. - StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread. - The received is posted to the registered listeners. The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong. In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries. Note that this list needs to be maintained separately from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from `StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated. Credit goes to zsxwing for coming up with the initial idea. ## How was this patch tested? Updated test harness code to use the correct session, and added new unit test. Author: Tathagata Das <[email protected]> Closes apache#16186 from tdas/SPARK-18758. (cherry picked from commit 9ab725e) Signed-off-by: Tathagata Das <[email protected]>
…or L1 and elastic-net ## What changes were proposed in this pull request? WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change. ## How was this patch tested? Docs only. Generated documentation to make sure Latex looks ok. Author: sethah <[email protected]> Closes apache#16139 from sethah/SPARK-18705. (cherry picked from commit 8225361) Signed-off-by: Yanbo Liang <[email protected]>
## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <[email protected]> Closes apache#16169 from yanboliang/spark-18326. (cherry picked from commit 9725549) Signed-off-by: Yanbo Liang <[email protected]>
## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <[email protected]> Closes apache#16148 from yanboliang/spark-18325. (cherry picked from commit 9bf8f3c) Signed-off-by: Yanbo Liang <[email protected]>
…ythonExec so input_file_name function can work with UDF in pyspark ## What changes were proposed in this pull request? `input_file_name` doesn't return filename when working with UDF in PySpark. An example shows the problem: from pyspark.sql.functions import * from pyspark.sql.types import * def filename(path): return path sourceFile = udf(filename, StringType()) spark.read.json("tmp.json").select(sourceFile(input_file_name())).show() +---------------------------+ |filename(input_file_name())| +---------------------------+ | | +---------------------------+ The cause of this issue is, we group rows in `BatchEvalPythonExec` for batching processing of PythonUDF. Currently we group rows first and then evaluate expressions on the rows. If the data is less than the required number of rows for a group, the iterator will be consumed to the end before the evaluation. However, once the iterator reaches the end, we will unset input filename. So the input_file_name expression can't return correct filename. This patch fixes the approach to group the batch of rows. We evaluate the expression first and then group evaluated results to batch. ## How was this patch tested? Added unit test to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#16115 from viirya/fix-py-udf-input-filename. (cherry picked from commit 6a5a725) Signed-off-by: Wenchen Fan <[email protected]>
… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip. Author: Andrew Ray <[email protected]> Closes apache#16121 from aray/fix-cartesian. (cherry picked from commit 3c68944) Signed-off-by: Davies Liu <[email protected]>
This PR has 2 key changes. One, we are building source package (aka bundle package) for SparkR which could be released on CRAN. Two, we should include in the official Spark binary distributions SparkR installed from this source package instead (which would have help/vignettes rds needed for those to work when the SparkR package is loaded in R, whereas earlier approach with devtools does not) But, because of various differences in how R performs different tasks, this PR is a fair bit more complicated. More details below. This PR also includes a few minor fixes. These are the additional steps in make-distribution; please see [here](https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md) on what's going to a CRAN release, which is now run during make-distribution.sh. 1. package needs to be installed because the first code block in vignettes is `library(SparkR)` without lib path 2. `R CMD build` will build vignettes (this process runs Spark/SparkR code and captures outputs into pdf documentation) 3. `R CMD check` on the source package will install package and build vignettes again (this time from source packaged) - this is a key step required to release R package on CRAN (will skip tests here but tests will need to pass for CRAN release process to success - ideally, during release signoff we should install from the R source package and run tests) 4. `R CMD Install` on the source package (this is the only way to generate doc/vignettes rds files correctly, not in step # 1) (the output of this step is what we package into Spark dist and sparkr.zip) Alternatively, R CMD build should already be installing the package in a temp directory though it might just be finding this location and set it to lib.loc parameter; another approach is perhaps we could try calling `R CMD INSTALL --build pkg` instead. But in any case, despite installing the package multiple times this is relatively fast. Building vignettes takes a while though. Manually, CI. Author: Felix Cheung <[email protected]> Closes apache#16014 from felixcheung/rdist. (cherry picked from commit c3d3a9d) Signed-off-by: Shivaram Venkataraman <[email protected]>
…Utils.tryOrStopSparkContext ## What changes were proposed in this pull request? When `SparkContext.stop` is called in `Utils.tryOrStopSparkContext` (the following three places), it will cause deadlock because the `stop` method needs to wait for the thread running `stop` to exit. - ContextCleaner.keepCleaning - LiveListenerBus.listenerThread.run - TaskSchedulerImpl.start This PR adds `SparkContext.stopInNewThread` and uses it to eliminate the potential deadlock. I also removed my changes in apache#15775 since they are not necessary now. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#16178 from zsxwing/fix-stop-deadlock. (cherry picked from commit 26432df) Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? This patch fixes the format specification in explain for file sources (Parquet and Text formats are the only two that are different from the rest): Before: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: org.apache.spark.sql.execution.datasources.text.TextFileFormatxyz, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` After: ``` scala> spark.read.text("test.text").explain() == Physical Plan == *FileScan text [value#15] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/scratch/rxin/spark/test.text], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string> ``` Also closes apache#14680. ## How was this patch tested? Verified in spark-shell. Author: Reynold Xin <[email protected]> Closes apache#16187 from rxin/SPARK-18760. (cherry picked from commit 5f894d2) Signed-off-by: Reynold Xin <[email protected]>
This PR changes the SparkR source release tarball to be built using the Hadoop 2.6 profile. Previously it was using the without hadoop profile which leads to an error as discussed in apache#16014 (comment) Author: Shivaram Venkataraman <[email protected]> Closes apache#16218 from shivaram/fix-sparkr-release-build. (cherry picked from commit 202fcd2) Signed-off-by: Shivaram Venkataraman <[email protected]>
…st. (apache-spark-on-k8s#378) * Retry binding server to random port in the resource staging server test. * Break if successful start * Start server in try block. * FIx scalastyle * More rigorous cleanup logic. Increment port numbers. * Move around more exception logic. * More exception refactoring. * Remove whitespace * Fix test * Rename variable
* set RestartPolicy=Never for executor As for current implementation the RestartPolicy of executor pod is not set, so the default value "OnFailure" is in effect. But this causes problem. If an executor is terminated unexpectedly, for example, exit by java.lang.OutOfMemoryError, it'll be restarted by k8s with the same executor ID. When the new executor tries to fetch a block hold by the last executor, ShuffleBlockFetcherIterator.splitLocalRemoteBlocks() think it's a **local** block and tries to read it from it's local dir. But the executor's local dir is changed because random generated ID is part of local dir. FetchFailedException will raise and the stage will fail. The rolling Error message: 17/06/29 01:54:56 WARN KubernetesTaskSetManager: Lost task 0.1 in stage 2.0 (TID 7, 172.16.75.92, executor 1): FetchFailed(BlockManagerId(1, 172.16.75.92, 40539, None), shuffleId=2, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: /data2/spark/blockmgr-0e228d3c-8727-422e-aa97-2841a877c42a/32/shuffle_2_0_0.index (No such file or directory) at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) * Update KubernetesClusterSchedulerBackend.scala
…he-spark-on-k8s#383) This makes executors consistent with the driver. Note that SPARK_EXTRA_CLASSPATH isn't set anywhere by Spark itself, but it's primarily meant to be set by images that inherit from the base driver/executor images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small changes but overall the design looks sound. The idea of having a separate orchestrator for "inner" steps is one that we shouldn't use too liberally - one could imagine a convoluted case where we nest steps excessively and we have a deep tree of execution steps. But in this case I think it's fine for now. Probably a good rule of thumb is to max out the steps depth to 2.
override def bootstrapMainContainerAndVolumes( | ||
originalPodWithMainContainer: PodWithMainContainer) | ||
: PodWithMainContainer = { | ||
import scala.collection.JavaConverters._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always put imports at the top.
import scala.collection.JavaConverters._ | ||
logInfo("HADOOP_CONF_DIR defined. Mounting HDFS specific .xml files") | ||
val keyPaths = hadoopConfigFiles.map(file => | ||
new KeyToPathBuilder().withKey(file.toPath.getFileName.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these builders every method invoked on the builder should be on a separate line. See the examples from the rest of the submission client code.
@@ -148,7 +148,9 @@ private[spark] class Client( | |||
} | |||
|
|||
private[spark] object Client { | |||
def run(sparkConf: SparkConf, clientArguments: ClientArguments): Unit = { | |||
def run(sparkConf: SparkConf, | |||
clientArguments: ClientArguments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-Indent by one
import org.apache.spark.deploy.kubernetes.submit.submitsteps.hadoopsteps.{HadoopConfigSpec, HadoopConfigurationStep} | ||
|
||
/** | ||
* Configures the driverSpec that bootstraps dependencies into the driver pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is possibly out of date - specify Hadoop somewhere in here.
|
||
private[spark] class HadoopConfBootstrapImpl( | ||
hadoopConfConfigMapName: String, | ||
hadoopConfigFiles: Array[File]) extends HadoopConfBootstrap with Logging{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a very small slight preference for Seq
or List
over Array
but that's just me - they're probably very similar in practice.
private[spark] val KUBERNETES_KERBEROS_SUPPORT = | ||
ConfigBuilder("spark.kubernetes.kerberos") | ||
.doc("Specify whether your job is a job " + | ||
"that will require a Delegation Token to access HDFS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the entire doc can fit on one line then make it fit on one line. Otherwise break the line as far to the right of the sentence as possible.
@@ -94,6 +99,22 @@ private[spark] class DriverConfigurationStepsOrchestrator( | |||
submissionSparkConf) | |||
val kubernetesCredentialsStep = new DriverKubernetesCredentialsStep( | |||
submissionSparkConf, kubernetesResourceNamePrefix) | |||
val hadoopConfigurations = hadoopConfDir.map(conf => getHadoopConfFiles(conf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we parse out the hadoop configuration files in the step and not in the orchestrator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps in the Hadoop orchestrator.
pythonStep.toSeq | ||
} | ||
private def getHadoopConfFiles(path: String) : Array[File] = { | ||
def isFile(file: File) = if (file.isFile) Some(file) else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid inner methods - move this outside, but in this case it's probably fine to inline the code in the lambda below.
extends DriverConfigurationStep { | ||
|
||
override def configureDriver(driverSpec: KubernetesDriverSpec): KubernetesDriverSpec = { | ||
import scala.collection.JavaConverters._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imports at the top of the file.
runSparkApplicationAndVerifyCompletion( | ||
JavaMainAppResource(CONTAINER_LOCAL_MAIN_APP_RESOURCE), | ||
SPARK_PI_MAIN_CLASS, | ||
Seq("HADOOP_CONF_DIR defined. Mounting HDFS specific .xml files", "Pi is roughly 3"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see a test that's specific to handling HDFS in a way that's dependent on the Hadoop configurations. But it might be hard to write such a test. We probably need a new test job.
@@ -361,6 +362,7 @@ private[spark] class KubernetesClusterSchedulerBackend( | |||
* @return A tuple of the new executor name and the Pod data structure. | |||
*/ | |||
private def allocateNewExecutorPod(nodeToLocalTaskCount: Map[String, Int]): (String, Pod) = { | |||
import scala.collection.JavaConverters._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move import to the top.
|
||
import org.apache.spark.{SparkContext, SparkEnv, SparkException} | ||
import org.apache.spark.deploy.kubernetes.{ConfigurationUtils, InitContainerResourceStagingServerSecretPlugin, PodWithDetachedInitContainer, SparkPodInitContainerBootstrap} | ||
import org.apache.spark.deploy.kubernetes._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid wildcard imports. Exception would be our config
and constants
and JavaConverters
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this everywhere - I think all files affected by this change had changed imports. Perhaps check IDE settings - I know for IntelliJ I had to change the import settings to make it never import with wildcards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will get clunky as we are importing a ton of classes and in the Orchestrator a ton of step classes
b5397f2
to
bbec614
Compare
…ILE_DIR for Volume mount
bbec614
to
3f1c567
Compare
3f1c567
to
857ae01
Compare
50c8fbf
to
3f1c567
Compare
Moved to #391 |
What changes were proposed in this pull request?
This it the on-going work of setting up Secure HDFS interaction with Spark-on-K8S.
The architecture is discussed in this community-wide google doc
This initiative can be broken down into 4 stages.
STAGE 1
HADOOP_CONF_DIR
environmental variable and using Config Maps to store all Hadoop config files locally, while also settingHADOOP_CONF_DIR
locally in the driver / executorsSTAGE 2
TGT
fromLTC
or using keytabs+principle and creating aDT
that will be mounted as a secretSTAGE 3
STAGE 4
How was this patch tested?